Scaling the Stars

Optimizing MPI communication on GPUs in the PROMPI stellar dynamics code

Miren Radia

Research Computing Services, University of Cambridge

Monday 19 May 2025

Introduction

The team

Raphael Hirschi
Professor of Stellar Hydrodynamics and Nuclear Astrophysics
Keele University

Vishnu Varma
Research Associate In Theoretical Stellar Astrophysics
Keele University

Kate Goodman
PhD Student
Keele University

Miren Radia
Research Software Engineer
University of Cambridge

Simon Clifford
Research Software Engineer
University of Cambridge

Others

  • Federico Rizzuti, former PhD Student, Keele University

PROMPI

What does the code do?

  • PROMPI is a fluid dynamics code that is used to simulate complex hydrodynamic processes within stars.
  • Numerical methods:
    • Finite volume
    • Eulerian
    • Piecewise Parabolic Method (PPM) hydrodynamics scheme
  • Physics:
    • Fully compressible fluids
    • Nuclear burning/convection/turbulence
  • Code:
    • Fortran
    • Domain decomposed and distributed with MPI

Evolution of \(|\mathbf{v}|\) for a \(1024^3\) simulation of the Carbon-burning shell

Previous RSE work

What improvements had already been made to the code?

Over several DiRAC RSE projects, the code has been enhanced and modernized in several different ways:

  • Acceleration on Nvidia GPUs using OpenACC
  • Fortran 77 → Modern free-form Fortran
  • Object-oriented design (Fortran 2003)
  • Legacy include statements and common blocks → Modules
  • Non-standard Makefile build system → CMake
  • Non-standard binary I/O format → HDF5
  • Regression tests and GitLab CI pipeline to run them

Improving MPI communication

Starting place

How was communication handled previously?

Previously the code used:

  • Nvidia managed memory extension to OpenACC:
    • When data access is attempted on the host (CPU)/device (GPU) but the data resides on the device/host, a page fault triggers the runtime to migrate the data across.
  • MPI derived datatypes:
    • Data arrays have halo regions/ghost cells to allow calculating derivatives.
    • In each direction, these are non-contiguous in memory but regularly spaced.
    • MPI_Type_vector is designed to handle this type of memory layout: Non-contiguous memory layout in an MPI_Type_vector
  • Effectively blocking MPI calls:
    • The data for each variable is stored in separate arrays.
    • The ghost data for each array was sent in separate MPI_Isends.
    • However, MPI_Wait was called after every MPI_Irecv.

The problem

How did this configuration lead to poor performance on multiple GPUs?

  • Because of managed memory, each contiguous chunk of the MPI_Type_vector was separately migrated from device to host memory.
  • These were then communicated using MPI from the host memory on 1 rank to another.
  • The effectively blocking MPI calls meant these transfers were likely performed one-by-one.
  • Lots of small host-device transfers (visible in Nvidia Nsight Systems and Linaro Forge profiling) → poor performance.
  • For a \(512^3\) test simulation running on 8 Tursa Nvidia A100s (2 nodes), basic timer profiling showed that > 90% of the walltime was spent in communication.

The solution1

How were the communications optimised?

I significantly refactored the communication in the following ways:

  • Manual packing and unpacking of data:

    • Single send/receive buffer for each pair of communicating processes.
    • Asynchronous OpenACC kernels to [un]pack data from/to ghosts cells from all variables into the single buffer on the GPU.
    • No more MPI_Type_vector.
  • Forced use of GPU-aware MPI:

    !$acc host_data use_device(send_buf)
    call mpi_isend(send_buf, ...)
  • MPI_Waitall after all sends and receives for each direction (truly asynchronous).

Results

How much better is the performance following these changes?

After these changes with our test case:

  • ~200x speed-up in communication leading to ~20x overall speed-up.
  • < 10% of the walltime spent in communication.

Scaling

How does PROMPI scale after these improvements?

Weak scaling on Tursa

  • Excellent weak scaling of 88% efficiency up to 128 GPUs.
  • Most relevant scaling for group given typical research workflows.

Strong scaling on Tursa and COSMA8

  • Good strong scaling (>50% efficiency) up to around 32 Tursa Nvidia A100 80GB GPUs.
  • Efficiency drops for greater numbers due to GPU underutilization.
  • Grey line shows roughly how many COSMA8 (Milan) nodes are equivalent to 1 Tursa GPU.

Any questions?